pacman::p_load(readxl,reshape,survival,survminer,ggplot2,cowplot,dplyr,My.stepwise,haven,tidyr,knitr,lubridate,naniar,table1, Hmisc,skimr,psych,lavaan)

1. Path Analysis

Structural Equation Model (SEM) si a general analytic framework that include t-test, ANOVA, MANOVA, multiple regression, path analysis,factor analysis, growth trajectories.

1.0 Overall Concepts

  1. Specification

Predictor, Outcomes, Mediators, Factors. Use Path Diagrams for specification

  1. Identification
    • the t-rule: \(k = \frac{(p+q)(p+q+1)}{2}+(p+q)\), number of parameters must \(\le\)k. Necessay but not sufficient
    • the Null B rule: \(\mathbf{y_i= \alpha+ By_i+ \Gamma x_i+ \zeta_i}\), when B=0, there is no relations among the DVs. Sufficient
    • the recursive rule: no feedback or bi-directional effects, no correlated residuals. Sufficient but not necessary.
    • the common sense rule

possible to obtain all model parameter estimates?

Model-Implied Moments: Model is a set of parameters. Population Moments include means and covariance. Model-Implied Moments is moment matrix implied by the model, which is similar to y and \(\hat{y}\). can each parameter be uniquely expressed using observed information? Extent to which there is sufficient known information to infer unknown values. Over-identified is observed > estimated. population moments are \(\Sigma\) and \(\mu\). Sample moments are S and m. population model-implied moments are \(\Sigma(\theta)\) and \(\mu(\theta)\)

  1. Estimation

How to obtain best estimates. ML is asymptotically unbiased,consistent, and maximally efficient.

  1. Evaluation
    • R square: variance explained by the model
    • Chi-Square Test Statistics: compare with the saturated model which impose no restrictions on means or covariances. non-significant means the tested model is good. Poorly fitting model can be accepted because there is insufficient power to detect the misspecification. Used for nested model comparison.
    • Relative Goodness of Fit Indices:TLI,CFI, IFI >0.95 is good. baseline model, all mean and variance and no covariance. hypothesized model is the tested model.
    • Absolute Goodness of Fit Indices: RMSEA (\(\le\) 0.05 is excellent; \(\le\) 0.08 moderate fit; >1.0 is poor). Standardized Root Mean Square Residual SRMR (\(\le\) .08 is good, not penalized for the number of model parameters, so tends to be better for less restricted models).

How well does the model fit the data. Relative fit of competing models can be compared using likelihood ratio tests (chi-square difference tests). Regressions are just-identified, uses the same number of parameters to exactly reproduce the observed moments. So the model necessarily fits perfectly and is said to be saturated. Most SEMs, are over-identified, resulting testable restrictions on model-implied moments.

  1. Potential re-specification
    • Modification Indices (MIs): expected change (based on model derivatives not actual likelihood ratio tests. but very close estimates). Calculated for every parameter that is restricted. Take one at a time, entiredly data driven

Can the model be improved? How are modifications identified

  1. Interpretation

Which effects are significant? Which are substan

  • Disturbance: unexplained variance of an endogenous variable. the part that is unrelated to the predictors

1.1 Dataset Description: Deviant Peer Addiliations

read Deviant Peer Addiliations dataset The header argument of the read.table function is set to FALSE here to indicate that there are no variable names at the top of the data file.

afdp <- read.table("data/afdp.dat", header = FALSE ,col.names =c("id","coa","age","gen","stress","emotion","negaff","peer"))
psych::describe(afdp)

peer adolescent report on peer substance use and peer tolerance of use

coa parent report of alcoholism diagnosis where 0=non-alcoholic and 1=alcoholic

gen gender 0=girl 1=boy

age age measured in years at assessment

stress self report measure of uncontrollable negative life stressful events

emotion self report measure of temperamental emotional expressiveness

negaff self report measure of depression and anxiety

1.2 Simple Regression

peer = α + γ coa + ζ

First, we need to create what’s called a model syntax object.Then fit the model using the sem function, placing the results into the data object fit. Note that we include the meanstructure argument with the sem function so that we will obtain an estimate of the regression intercept.

model.1 <- 'peer ~ coa'
summary(sem(model.1, data=afdp, meanstructure=TRUE),standardized=TRUE,rsquare=TRUE)
lavaan 0.6-12 ended normally after 9 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           316

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  peer ~                                                                
    coa               0.176    0.060    2.956    0.003    0.176    0.164

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .peer              0.298    0.043    6.884    0.000    0.298    0.554

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .peer              0.280    0.022   12.570    0.000    0.280    0.973

R-Square:
                   Estimate
    peer              0.027

we are using ML (maximum likelihood) estimation with the nlminb optimizer (this is a built-in optimizer in R for finding the minimum of a function).

It is worth pausing here to note that lavaan counts parameters differently than we described in lecture. For this model, we would normally count five parameters, the mean and variance of COA, the intercept and slope of the regression of Peer on COA, and the residual variance of Peer. But lavaan does not count the mean or variance of the predictor, COA, as parameters of the model. Nor does it count covariances among predictors (though there are no such covariances in the current model) as free parameters. Fortunately, lavaan also does not count these parameters when determining the number of observed moments, so the degrees of freedom for the test statistic work out regardless (e.g., here it neither counts the mean and variance of COA as observed moments nor does it count them as estimated moments, leaving a net difference of zero when calculating the degrees of freedom).

This output also provides some information about model fit. Multiple regression models are just identified (i.e., every piece of information provided by the sample is ‘used up’ to estimate model parameters so that no degrees of freedom remain). Therefore, the model fits the data perfectly and it is not worthwhile to interpret the test statistic (this and other fit indices will be discussed in later sections).

The output is subdivided into three parts, regressions, intercepts, and variances. In the Regressions section, the peer ~ coa coefficient is the slope parameter estimate γ . In the ˆ Intercepts section, the coefficient listed for .peer is the intercept α . Finally, in the Variances section, the coefficient listed for .peer is the represents represents ψ , the estimated variance of ζ i (i.e., residual variance). The Estimate column lists is the ML point estimate of each parameter. The Std.Err column gives the standard error for each estimate (where these are computed using the expected information matrix, as noted above). Next, the z-value column reports the Wald z-statistic for the null hypothesis test that the parameter is significantly different from zero in the population, and the P(>|z|) column reports the p-value associated with this z-statistic.

Here, we see that the average non-COA has a score of .298 on peer and the average COA has a score that is .176 units higher than non-COAs on peer (.298 + .176 = .474). Both the intercept and slope are significantly different from zero. Finally, the variance in deviant peer association that is not explained by COA status is .280. This indicates that, although COA status is a significant predictor of peer, COA status does not fully account for affiliation with peers.

Because peer is not on an intrinsically meaningful scale, we cannot easily interpret the differences among the regression parameters or residual variances. To better interpret these results, we can request standardized estimates. Within lavaan, several methods of standardization are available within the summary function. The first type, obtained by including standardized=TRUE, is the typical fully standardized solution. In this case, both the independent and dependent variables are standardized to have a variance of 1 (note that lavaan does not, however, also make the means 0, as one might expect).

The Std.lv column refers to a solution in which only the latent variables are standardized and not the observed variables. Since the current model contains no latent variables, this is of little use here. The Std.all column contains the fully standardized estimates of interest, in which all variables (observed and latent) are standardized to have a variance of one. Thus, we say that a one standard deviation increase in COA status is associated with a .164 standard deviation increase in deviant peer associations.

It may be more useful to retain the original scaling of COA while standardizing peer. If we include std.nox=TRUE in the summary function, as shown below, we will obtain estimates in which the latent variables and endogenous observed variables are standardized but the exogenous observed variables are left in their raw scale. These are commonly known as partially standardized estimates.

Finally, we can also calculate how much variance is explained in peer by coa by requesting that lavaan output the r-squared statistic. R-Square is the estimated proportion of variance in peer that is accounted for by the models (i.e., COA, the only predictor included in this model). We have explained less than 3% of the total variance in peer with the COA predictor. Note, however, that measured variable regression models assume that no measurement error is present. If measurement error is present, the estimated relationship among the variables in the model may be attenuated and the R 2 value will also be underestimated.

1.3 Multiple Regression

peer = α + γ1coa + γ 2gen + γ 3age + ζ

we have specified that peer is regression on coa, gen, and age. Note that, again, we use the regression operator ~ to indicate the regression. Now that our model includes multiple predictors, we use the + operator in between each predictor.

summary( sem('peer ~ coa + gen + age', data=afdp, meanstructure=TRUE), std.nox=TRUE, rsquare=TRUE)
lavaan 0.6-12 ended normally after 19 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           316

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.nox
  peer ~                                                                
    coa               0.210    0.054    3.858    0.000    0.210    0.391
    gen              -0.048    0.055   -0.877    0.381   -0.048   -0.089
    age               0.151    0.019    7.993    0.000    0.151    0.281

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.nox
   .peer             -1.610    0.250   -6.453    0.000   -1.610   -2.999

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.nox
   .peer              0.232    0.018   12.570    0.000    0.232    0.804

R-Square:
                   Estimate
    peer              0.196

We see that gen is not a significant predictor of peer when COA and age are also included in the model (p = .381). However, age is significantly related with peer such that older adolescents are more likely to associate with deviant peers. We estimate that a one-year increase in age is associated with a .151-increase in deviant peer association ratings. COA remains a significant predictor of peer, and the regression parameter estimate (i.e., it has increased from .18 to .21) has not changed much from the single-predictor model to the multiple-predictor model relative to its standard error (0.06). The stability of this coefficient reflects the fact that COA is not highly correlated with gen or age.

To better interpret these results, and to get a sense for the relative contribution of each predictor, we requested the partially standardized solution. We prefer this method of standardization for this example because all three predictors have natural scales. As can be seen in the Std.nox column of output, after controlling for age and gender, COAs affiliate with deviant peers about .391 standard deviation units more than non-COAs. Each additional year of age is associated with a .281 standard deviation increase in self-reported affiliation with deviant peers.

From the R-square output, we can see that the multiple predictor regression model explains more of the variance in peer than the single predictor model. Age and gender account for an additional 17% of the variance in peer, over and above COA status. Still, approximately 80% of the variance in peer remains unexplained by the variables in our model.

1.4 Path Analysis

1.4.1 Original Model

Theory dictates that alcoholic parents increase the number of stressful life events that their children experience, leading to an increase in child negative affect. Further, children of alcoholics are hypothesized to have higher levels of emotionality, leading to more negative affect. Negative affect is thought to be related to higher rates of affiliation从属,加盟 with deviant peers. Stressful life events and emotionality should covary, but we hypothesize no directional relation among these variables. We allow age and gen to predict stress and emotion, and we allow all exogenous characteristics (coa, gen, and age) to covary.

As a default, lavaan allows all exogenous variables to covary but it fixes the residual covariances among endogenous variables to zero. The residual terms of stress and emotion are freed to covary within the section of the code marked #covariances. To designate a covariance (or variance) we use the ~~ operator. Thus, stress ~~ emotion tells lavaan to include a residual covariance for stress and emotion.

The final three lines request the model-implied covariance matrix (fitted(fit)) and the raw and standardized residual matrices (resid(fit, type=“raw”) and resid(fit, type=“normalized”), respectively). These residuals represent the differences between the observed covariance matrix and the matrix that is implied by the structure of the model.

pathmodel.1 <- '
              #regressions
              stress ~ coa + gen + age
              emotion ~ coa + gen + age
              negaff ~ stress + emotion
              peer ~ negaff

              #covariances
              stress ~~ emotion              
              '
fit <- sem(pathmodel.1, data=afdp, meanstructure=TRUE)
summary(fit, standardized=TRUE, rsquare=TRUE)
lavaan 0.6-12 ended normally after 40 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        18

  Number of observations                           316

Model Test User Model:
                                                      
  Test statistic                                81.173
  Degrees of freedom                                 8
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  stress ~                                                              
    coa               0.451    0.072    6.223    0.000    0.451    0.331
    gen              -0.016    0.073   -0.215    0.830   -0.016   -0.011
    age               0.002    0.025    0.078    0.938    0.002    0.004
  emotion ~                                                             
    coa               0.110    0.056    1.963    0.050    0.110    0.110
    gen              -0.048    0.056   -0.843    0.399   -0.048   -0.047
    age              -0.027    0.019   -1.374    0.170   -0.027   -0.077
  negaff ~                                                              
    stress            0.246    0.078    3.134    0.002    0.246    0.175
    emotion           0.553    0.106    5.206    0.000    0.553    0.290
  peer ~                                                                
    negaff            0.176    0.030    5.892    0.000    0.176    0.315

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .stress ~~                                                             
   .emotion           0.112    0.019    5.896    0.000    0.112    0.352

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stress            0.687    0.333    2.066    0.039    0.687    1.010
   .emotion           2.341    0.258    9.088    0.000    2.341    4.663
   .negaff            1.527    0.207    7.373    0.000    1.527    1.594
   .peer             -0.118    0.091   -1.298    0.194   -0.118   -0.220

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stress            0.412    0.033   12.570    0.000    0.412    0.890
   .emotion           0.247    0.020   12.570    0.000    0.247    0.979
   .negaff            0.778    0.062   12.570    0.000    0.778    0.848
   .peer              0.260    0.021   12.570    0.000    0.260    0.901

R-Square:
                   Estimate
    stress            0.110
    emotion           0.021
    negaff            0.152
    peer              0.099

We report fully standardized effects (from Std.all column generated by the first summary function call) except for coa, gen, and age, for which we report partially standardized effects (from Std.nox column generated by the second summary function call). Recall that only the outcome variables are standardized in computing the partially standardized effects, which makes them particularly useful for examining the effects of coding variables (e.g., coa and gen) or predictors with natural metrics (e.g., age).

These partially standardized parameter estimates represent the expected change in standard deviation units in y given a one unit increase in x. By comparison, the fully standardized estimates are computed by standardizing both the predictors and the outcome variables so that parameter estimates represent the expected change in standard deviation units in y given a one standard deviation increase in x. These are most useful for interpreting effects when neither predictors nor outcomes have natural scales and when gauging relative effect sizes across predictors on different scales. For covariance parameters, fully standardized estimates can be interpreted as correlations.

From the R-Square section of output, we can also see that only a modest amount of the total variance in any of these variables has been explained by the model.

resid(fit, type="normalized")
$type
[1] "normalized"

$cov
        stress emotin negaff peer   coa    gen    age   
stress   0.000                                          
emotion  0.000  0.000                                   
negaff   0.000  0.000  0.000                            
peer     2.657  0.395  0.000  0.000                     
coa      0.000  0.000 -0.163  2.685  0.000              
gen      0.000  0.000 -1.580 -1.548  0.000  0.000       
age      0.000  0.000  3.229  8.015  0.000  0.000  0.000

$mean
 stress emotion  negaff    peer     coa     gen     age 
      0       0       0       0       0       0       0 

The normalized residuals are rescaled to follow a standard normal distribution and values exceeding plus or minus two are often taken as potentially meaningful in magnitude. The normalized residuals for this model suggest that the hypothesized structure is doing a poor job in reproducing the observed covariances between several variables, most notably between age and peer, between age and negaff, and between stress and peer. We will explore methods for more formally assessing overall model fit in the next chapter.

1.4.2 Evaluation and re-specification

We obtain measures of fit, appropriately enough, by including fit.measures=TRUE

summary(fit, fit.measures=TRUE)
lavaan 0.6-12 ended normally after 40 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        18

  Number of observations                           316

Model Test User Model:
                                                      
  Test statistic                                81.173
  Degrees of freedom                                 8
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                               251.027
  Degrees of freedom                                18
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.686
  Tucker-Lewis Index (TLI)                       0.293

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -1158.938
  Loglikelihood unrestricted model (H1)      -1118.351
                                                      
  Akaike (AIC)                                2353.875
  Bayesian (BIC)                              2421.479
  Sample-size adjusted Bayesian (BIC)         2364.387

Root Mean Square Error of Approximation:

  RMSEA                                          0.170
  90 Percent confidence interval - lower         0.138
  90 Percent confidence interval - upper         0.205
  P-value RMSEA <= 0.05                          0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.085

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  stress ~                                            
    coa               0.451    0.072    6.223    0.000
    gen              -0.016    0.073   -0.215    0.830
    age               0.002    0.025    0.078    0.938
  emotion ~                                           
    coa               0.110    0.056    1.963    0.050
    gen              -0.048    0.056   -0.843    0.399
    age              -0.027    0.019   -1.374    0.170
  negaff ~                                            
    stress            0.246    0.078    3.134    0.002
    emotion           0.553    0.106    5.206    0.000
  peer ~                                              
    negaff            0.176    0.030    5.892    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
 .stress ~~                                           
   .emotion           0.112    0.019    5.896    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .stress            0.687    0.333    2.066    0.039
   .emotion           2.341    0.258    9.088    0.000
   .negaff            1.527    0.207    7.373    0.000
   .peer             -0.118    0.091   -1.298    0.194

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .stress            0.412    0.033   12.570    0.000
   .emotion           0.247    0.020   12.570    0.000
   .negaff            0.778    0.062   12.570    0.000
   .peer              0.260    0.021   12.570    0.000

We can request that lavaan suggest changes to our model based on the expected change in model chi-square if a fixed parameter were freed; these are the modification indices (MIs). We can obtain MIs with the modindices function. The sort. argument is set to TRUE so that lavaan will present the largest MIs first. Additionally, the minimum.value argument is set to suppress printing of any MIs less than 10.

modindices(fit, sort.=TRUE, minimum.value=10)

Here, lhs stands for “on the left-hand side of the operator” and rhs stands for “on the righthand side of the operator.” The operator, denoted op, for these parameters is ~, indicating these are regression slopes. mi denotes the expected improvement in the model chi square test statistic if the modification is accepted. epc denotes “expected parameter change (EPC)” and gives the expected parameter value if the modification is implemented (since it is changing from zero). sepc.lv re-scales the EPC by standardizing any latent variables in the model, sepc.all re-scales the EPC by standardizing all variables (observed and latent), and sepc.nox re-scales the EPC by standardizing all variables except exogenous predictors. Here, sepc.lv is equivalent to the EPC because there are no latent variables in the model. It can be helpful to rely on standardized EPC values in order to get a sense of the relative magnitude of each potential modification.

Final Model

pathmodel.5 <-'
              #regressions
              stress ~ coa + gen + age
              emotion ~ coa + gen + age
              negaff ~ stress + emotion + age
              peer ~ negaff + age + coa

              #covariances
              stress ~~ emotion              
              '
fit.5 <- sem(pathmodel.5, data=afdp, meanstructure=TRUE)
summary(fit.5, fit.measures=TRUE)
1.4.3 Mediation Test

Mediator explains why or how one variable exerts an effect on another. Indirect effect point estimate is product of coefficients in chain. the CI is calculated by delta method and bootstrapping.

Significant links in a mediational pathway are not sufficient to infer mediation. In order to formally test the full mediation effect, we need an inferential test of the entire specific indirect effect in question. Here, we want to know whether the specific mediational pathway of COA status on deviant peer affiliation via stressful events and negative affect is statistically significant, whether the specific mediational effect of coa on peer via emotionality and negative affect is significant, and whether the overall indirect effect of coa on peer is significant. Finally, we would like to have an estimate of the total effect of coa on peer considering both direct and indirect pathways.

we’ve added parameter labels for the paths involved in the direct and indirect effects of coa on peer. For instance, in the line stress ~ c_s*coa + gen + age, we have supplied the label c_s for the effect of coa on stress. Likewise, the other paths involved in computing the effects of interest have also been labeled. To help keep track of things, our labels consist of the first letter of the predictor, an underscore, and first letter of outcome, but any labels are fine. We can now refer to these labels when defining direct, indirect, and total effects.

Following the core model specification, we have new lines to define the direct, specific indirect, total indirect, and total effects. This is done using the “define” operator :=. For instance, ind_s := c_ss_nn_p requests that lavvan compute the quantity ind_s as the product of the three paths previously labeled c_s, s_n, and n_p.

effects.5 <-'
            #regressions
            stress ~ c_s*coa + gen + age
            emotion ~ c_e*coa + gen + age
            negaff ~ s_n*stress + e_n*emotion + age
            peer ~ n_p*negaff + age + c_p*coa 

            #covariances
            stress ~~ emotion         

            #direct effect
            dir := c_p

            #specific indirect effects
            ind_s := c_s*s_n*n_p
            ind_e := c_e*e_n*n_p

            #total indirect effect
            tot_ind := c_s*s_n*n_p + c_e*e_n*n_p

            #total effect
            tot := c_s*s_n*n_p + c_e*e_n*n_p + c_p
            '
set.seed(62973)
fit.5b <- sem(effects.5, data=afdp, meanstructure=TRUE, se="bootstrap", bootstrap=1000)

Note that we have used the set.seed function to set the random number seed for selecting the bootstrapped samples so that we can reproduce exactly these same results each time we run the script.

Then, in the sem function, we specified se=“bootstrap”, bootstrap=1000 to request that lavaan compute bootstrapped standard errors with 1000 bootstrapped samples. We actually don’t care much about the standard errors. What we really want are bootstrapped confidence intervals, which are generated in the parameterEstimates function with the boot.ci.type argument. The parameterEstimates function produces a simple list of the parameter estimates from the model. By including the boot.ci.type argument, we request that this output include 95% boostrapped confidence intervals.

Note that there exist multiple methods for calculating bootrapped CIs. The percentile method is probably easiest to understand – the 1000 boostrapped estimates are ordered by magnitude and the 25 th estimate (2.5 th percentile) and 975 th estimate (97.5 th percentile) are taken as the confidence limits, yielding a 95% confidence interval. This method, which is widely available in many software programs, is obtained via boot.ci.type = “perc”. That is the method we used in the lecture notes. A slightly preferable approach, however, is to use bias-corrected bootstrapped confidence intervals, and these are readily available in lavaan. To obtain the bias corrected confidence intervals, we simply use boot.ci.type = “bca.simple”. The biascorrected intervals are shown in the output below. In terms of null hypothesis testing, the same conclusions are generated with the bias-corrected CIs as with the percentile CIs presented in lecture, but there are some slight differences in the specific values of the confidence limits.

parameterEstimates(fit.5b, boot.ci.type = "perc")
parameterEstimates(fit.5b, boot.ci.type = "bca.simple")

The total effect of coa on peer is equal to .209 and represents a combination of the direct effect (.185) and the total indirect effect (.024). The 95% CI is equal to .094 and .306; because this does not contain zero, the total effect is deemed to be significant.

Examining the mediational pathways, we see that the specific indirect effect coa→stress→negaff→peer, labeled ind_s to indicate it is the indirect effect through stress, is equal to .015 (95% CI=.005, .036) and is significant (does not include zero). The biological pathway from coa→emotion→negaff→peer, labeled ind_e to indicate it is the indirect effect through emotion, is equal to .009 (95% CI=0, .022) and thus does not reach statistical significance (because the lower CI is equal to zero).

2. Factor Analysis

EFA

Latent variable also called: unmeasured variables, constructs, factors, true scores, unobserved variables, unobserved heterogeneity. Latent variable is used for: factors in factor analysis, latent variables in structural equation models, basis curves in latent curve analysis, latent classes (class/cluster/mixture models), latent traits, underlying propensity variables for censored/limited dependent variables, missing data, random effects in a multilevel or mixed model.

Goals of Factor Analysis: determine the underlying structure behind a set of observed measures.

Exploratory factor analysis is primarily data-driven: all observed variables regressed on all factors and the regression slopes. Confirmatory factory analysis is formally testing a theoretical factor structure, number and nature of latent variables specified by theory, relationships between indicators and latent variables specified by theory.

Outcome variable is regressed on the predictors. As the target predicted outcome y depends on the predictor x, you can say that “is regressed on” means “is dependent on”. The word “regressed” is used instead of “dependent” because we want to emphasise that we are using a regression technique to represent this dependency between x and y.

\(\Theta\) is residual covariance matrix

Communality. \(R^2\) for regression of the indicator on the set of common factors: items with high communalities are good items, as most of their variance reflects the influence of the common factors \[ h^2=1-\frac{VAR(\varepsilon)}{VAR(y)} \]

Conducting Process

  1. Determine number of latent factors
  2. Rotate factor pattern matrix at chosen number of factors
  3. Eliminate problematic items and refit model (items with large cross-loadings or with low communalities, redundant itmes otherwise may hijack the factor)
  4. Inspect final pattern matrix to determine meaning of latent factors

CFA

Setting the scale of latenet variables

  1. set means and variances of the latent variables to zero and one
    • puts latent variables on a standardized scale
    • allows for estimation of all factor loadings (and intercepts)
    • same loading scaling option that is used in EFA
  2. give each latent variable the same metric as one of the observed variables. The intercept and loading for the “scaling item” are set to zero and one

CFA Example dataset description— The data for this demonstration were provided by Holzinger & Swineford in their 1939 monograph A Study in Factor Analysis: The Stability of a Bi-Factor Solution. The sample includes 301 7th and 8th grade students, between 11-16 years of age, drawn from two schools. The data is in the text file hs.dat and the accompanying R-script file is ch04.R. The variables in the data set that we will use are

visperc visual perception test in which participants select the next image in a series

cubes visual perception test in which participants must mentally rotate a cube

lozenges visual perception test involving mental “flipping” of a parallelogram (“lozenge”)

parcomp paragraph comprehension test

sencomp sentence completion task in which participants select most appropriate word to put at the end of a sentence

wordmean verbal ability test in which participants must select a word most similar in meaning to a word used in a sentence.

addition participants have 2 minutes to complete as many 2-number addition problems as they can

countdot participants have 4 minutes to count the number of dots in each of a series of dot pictures

sccaps participants have 3 minutes to indicate whether capital letters are composed entirely of straight lines or include curved lines.

Other variables in the data not included in the models fit here are school, female, age, and month.

hs <- read.table("data/hs.dat", header=FALSE) 

names(hs) <- c("school", "female", "age", "month", "visperc", "cubes", "lozenges", "parcomp", "sencomp", "wordmean", "addition", "countdot", "sccaps")

rescaled the variables addition, countdot, and sccaps by dividing by four: This was done to facilitate model estimation, as numerical instability problems can arise when using variables on widely differing scales. Dividing these variables by four brings their standard deviations closer to the standard deviations of the other observed variables.

hs$addition <- hs$addition/4 
hs$countdot <- hs$countdot/4 
hs$sccaps <- hs$sccaps/4

Finally, we load the lavaan package that we will use to fit the CFA models:

cfa function in lavaan imposes a number of defaults consistent with how CFA models are typically specified and enables us to specify models with compact model syntax objects. The default scaling for the latent variables is a hybrid of the approaches we described in lecture. The default scaling for the latent variables is a hybrid of the approaches we described in lecture. By default, the latent variables will have means set to zero but freely estimated variance. Additionally, the factor loading of the first indicator for each latent variable will be set to one but the intercept for this indicator will be freely estimated. We will override these defaults to implement the scaling options described in lecture. in defining the model syntax object, we implement the =~ operator, which is used to define latent variables (left hand side) and to specify the variables that load on them (right hand side). Thus, visual =~ visperc + cubes + lozenges defines a new latent variable, visual, and indicates that visperc, cubes, and lozenges are indicator variables.

cfa.1a <-'#factor loadings
          visual =~ visperc + cubes + lozenges
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps
         '

fit.1a <- cfa(cfa.1a, data=hs, meanstructure=TRUE, std.lv=TRUE)
summary(fit.1a, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)
lavaan 0.6-12 ended normally after 28 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        30

  Number of observations                           301

Model Test User Model:
                                                      
  Test statistic                                85.306
  Degrees of freedom                                24
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                               918.852
  Degrees of freedom                                36
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.931
  Tucker-Lewis Index (TLI)                       0.896

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -8326.241
  Loglikelihood unrestricted model (H1)      -8283.589
                                                      
  Akaike (AIC)                               16712.483
  Bayesian (BIC)                             16823.696
  Sample-size adjusted Bayesian (BIC)        16728.553

Root Mean Square Error of Approximation:

  RMSEA                                          0.092
  90 Percent confidence interval - lower         0.071
  90 Percent confidence interval - upper         0.114
  P-value RMSEA <= 0.05                          0.001

Standardized Root Mean Square Residual:

  SRMR                                           0.060

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  visual =~                                                             
    visperc           5.398    0.485   11.128    0.000    5.398    0.772
    cubes             1.992    0.310    6.429    0.000    1.992    0.424
    lozenges          5.249    0.595    8.817    0.000    5.249    0.581
  verbal =~                                                             
    parcomp           2.969    0.170   17.474    0.000    2.969    0.852
    sencomp           4.406    0.251   17.576    0.000    4.406    0.855
    wordmean          6.416    0.376   17.082    0.000    6.416    0.838
  speed =~                                                              
    addition          3.562    0.400    8.903    0.000    3.562    0.570
    countdot          3.655    0.330   11.090    0.000    3.655    0.723
    sccaps            6.030    0.585   10.305    0.000    6.030    0.665

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  visual ~~                                                             
    verbal            0.459    0.064    7.189    0.000    0.459    0.459
    speed             0.471    0.073    6.461    0.000    0.471    0.471
  verbal ~~                                                             
    speed             0.283    0.069    4.117    0.000    0.283    0.283

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .visperc          29.615    0.403   73.473    0.000   29.615    4.235
   .cubes            24.352    0.271   89.855    0.000   24.352    5.179
   .lozenges         18.003    0.521   34.579    0.000   18.003    1.993
   .parcomp           9.183    0.201   45.694    0.000    9.183    2.634
   .sencomp          17.362    0.297   58.452    0.000   17.362    3.369
   .wordmean         15.299    0.441   34.667    0.000   15.299    1.998
   .addition         24.069    0.360   66.766    0.000   24.069    3.848
   .countdot         27.635    0.291   94.854    0.000   27.635    5.467
   .sccaps           48.367    0.523   92.546    0.000   48.367    5.334
    visual            0.000                               0.000    0.000
    verbal            0.000                               0.000    0.000
    speed             0.000                               0.000    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .visperc          19.766    4.090    4.833    0.000   19.766    0.404
   .cubes            18.141    1.628   11.146    0.000   18.141    0.821
   .lozenges         54.037    5.800    9.317    0.000   54.037    0.662
   .parcomp           3.341    0.429    7.779    0.000    3.341    0.275
   .sencomp           7.140    0.934    7.642    0.000    7.140    0.269
   .wordmean         17.454    2.109    8.277    0.000   17.454    0.298
   .addition         26.430    2.691    9.823    0.000   26.430    0.676
   .countdot         12.192    1.855    6.573    0.000   12.192    0.477
   .sccaps           45.857    5.730    8.003    0.000   45.857    0.558
    visual            1.000                               1.000    1.000
    verbal            1.000                               1.000    1.000
    speed             1.000                               1.000    1.000

R-Square:
                   Estimate
    visperc           0.596
    cubes             0.179
    lozenges          0.338
    parcomp           0.725
    sencomp           0.731
    wordmean          0.702
    addition          0.324
    countdot          0.523
    sccaps            0.442

The modification indices for the model are obtained from the modindices function as follows:

modindices(fit.1a, sort.=TRUE, minimum.value=10)

Here we see the largest modification indices are associated with a cross-loading for sccaps on visual, and a correlated uniqueness for countdot with addition. It is important to keep in mind that both modification indices may be related to the same misspecification.

Initial Model with Scaling Items: model fit is the same

We shall choose the first indicator for each factor to be the scaling item. The intercept and factor loading for each scaling item is set to zero and one, respectively. This scaling option permits the means and variances of the latent factors to be estimated.

cfa.1b <-'#factor loadings
          visual =~ visperc + cubes + lozenges
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps
     
          #means/intercepts
          visual ~ 1
          verbal ~ 1
          speed ~ 1
          visperc ~ 0*1
          parcomp ~ 0*1
          addition ~ 0*1
         '

fit.1b <- cfa(cfa.1b, data=hs, meanstructure=TRUE)
#summary(fit.1b, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)

cfa.2 <- '#factor loadings
          visual =~ visperc + cubes + lozenges + sccaps
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps
         '
fit.2 <- cfa(cfa.2, data=hs, meanstructure=TRUE, std.lv=TRUE)
#summary(fit.2, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)

lavTestLRT(fit.2, fit.1a)
Chi-Squared Difference Test

       Df   AIC   BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
fit.2  23 16682 16796 52.382                                  
fit.1a 24 16712 16824 85.305     32.923       1  9.586e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
modindices(fit.2, sort.=TRUE, minimum.value=10)

Note that no further changes are suggested for the model.

cfa.3 <- '#factor loadings
          visual =~ visperc + cubes + lozenges 
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps

          #covariances
          addition ~~ countdot
         '
fit.3 <- cfa(cfa.3, data=hs, meanstructure=TRUE, std.lv=TRUE)
#summary(fit.3, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)

lavTestLRT(fit.3, fit.1a)
Chi-Squared Difference Test

       Df   AIC   BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
fit.3  23 16682 16797 53.272                                  
fit.1a 24 16712 16824 85.305     32.033       1  1.516e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
modindices(fit.3, sort.=TRUE, minimum.value=10)

This model is not nested with the alternative modified model; however, the two models provide nearly equivalent fit (RMSEA=.065 versus .066, TLI=.948 versus .946, etc.). Thus, model selection should be based on which modification is most plausible, and upon the interpretability of parameter estimates.

3. LSEM

  1. the measurement model is identical to CFA \(y_i= v +\Lambda\eta_i + \varepsilon_i\)
  2. the structural model is similar to path analysis \(\eta_i=\alpha + B\eta_i + \Gamma x_i + \zeta_i\)

Measurement invariance (MI) refers to extent to which same measurements conducted under different conditions yield same measures of attributes under study. Longitudinal invariance is said to exist if the distribution of the observed variables depends only on the level of the latent variable and not also on the time point at which it was measured f(y|\(\eta\),t)= f(y|\(\eta\))

The latent growth curve is capturing the growth as a latent factor. The latent curve model (LCM) is fundamentally a highly restricted CFA model. The reduced-form equation is \(y_i= v +\Lambda(\alpha+\zeta_i) + \varepsilon_i\)

Time invariate predictor, predict both intercept and slope change structural model: add \(\Gamma x_i\) into it. \(y_i= v +\Lambda(\alpha+\Gamma x_i+\zeta_i) + \varepsilon_i\)

while time-invariant covariates add predictors into structual model, time-varying covariates add predictors into measurement model. add \(Kz_i\). \(y_i= v +\Lambda(\alpha+\zeta_i) +Kz_i+ \varepsilon_i\)

4. MLM

lme {nlme} function is flexible in autocorrelation structure

summary(lme(AIQ~1 + SZ +T1+T2+T3 + SZ:T1 + SZ:T2 + SZ:T3, random = ~T1+T2|ID, data = cog_pre_long, na.action = na.omit,
     correlation = corCompSymm(form =  ~ 1 | ID),method = "ML", 
     control =list(msMaxIter = 1000, msMaxEval = 1000)))

lmer {lme4}

summary(lmer(AIQ ~ 1 + SZ +T1+T2+T3 + SZ:T1 + SZ:T2 + SZ:T3 + (1+T1+T2|ID), data=cog_dat_long))
---
title: "SEM"
output: 
  html_notebook: 
    toc: yes
    toc_depth: 6
---

```{r manage packeges}
pacman::p_load(readxl,reshape,survival,survminer,ggplot2,cowplot,dplyr,My.stepwise,haven,tidyr,knitr,lubridate,naniar,table1, Hmisc,skimr,psych,lavaan)
```

### 1. Path Analysis
Structural Equation Model (SEM) si a general analytic framework that include t-test, ANOVA, MANOVA, multiple regression, path analysis,factor analysis, growth trajectories.

#### 1.0 Overall Concepts

1. **Specification**

Predictor, Outcomes, Mediators, Factors. Use **Path Diagrams** for specification

2. **Identification**
    + the **t-rule**: $k = \frac{(p+q)(p+q+1)}{2}+(p+q)$, number of parameters must $\le$k. Necessay but not sufficient
    + the **Null B rule**: $\mathbf{y_i= \alpha+ By_i+ \Gamma x_i+ \zeta_i}$, when **B**=0, there is no relations among the DVs. Sufficient
    + the **recursive rule**: no feedback or bi-directional effects, no correlated residuals. Sufficient but not necessary.
    + the **common sense rule**

possible to obtain all model parameter estimates?

Model-Implied Moments: Model is a **set of parameters**. Population Moments include **means and covariance**. Model-Implied Moments is moment matrix implied by the model, which is **similar to y and $\hat{y}$**. can each parameter be uniquely expressed using observed information? Extent to which there is sufficient known information to infer unknown values. **Over-identified** is observed > estimated. population moments are $\Sigma$ and $\mu$. Sample moments are S and m. population model-implied moments are  $\Sigma(\theta)$ and $\mu(\theta)$

3. **Estimation**

How to obtain best estimates. ML is asymptotically unbiased,consistent, and maximally efficient. 

4. **Evaluation**
    + **R square**: variance explained by the model
    + **Chi-Square Test Statistics**: compare with the saturated model which impose no restrictions on means or covariances. non-significant means the tested model is good. Poorly fitting model can be accepted because there is insufficient power to detect the misspecification. **Used for nested model comparison**. 
    + Relative Goodness of Fit Indices:**TLI,CFI, IFI** **>0.95 is good**. baseline model, all mean and variance and no covariance. hypothesized model is the tested model.
    + Absolute Goodness of Fit Indices: **RMSEA** ($\le$ **0.05 is excellent**; $\le$ 0.08 moderate fit; >1.0 is poor). Standardized Root Mean Square Residual **SRMR** ($\le$ **.08 is good**, not penalized for the number of model parameters, so tends to be better for less restricted models).

How well does the model fit the data. Relative fit of competing models can be compared using likelihood ratio tests (chi-square difference tests). **Regressions are just-identified**, uses the same number of parameters to exactly reproduce the observed moments. So the model necessarily fits perfectly and is said to be **saturated**. Most SEMs, are over-identified, resulting testable restrictions on model-implied moments.
     
5. Potential **re-specification**
    + **Modification Indices (MIs)**: expected change (based on model derivatives not actual likelihood ratio tests. but very close estimates). Calculated for every parameter that is restricted. Take one at a time, entiredly data driven

Can the model be improved? How are modifications identified

6. **Interpretation**

Which effects are significant? Which are substan

* Disturbance: unexplained variance of an endogenous variable. the part that is unrelated to the predictors

#### 1.1 Dataset Description: Deviant Peer Addiliations 
read  Deviant Peer Addiliations dataset
The header argument of the read.table function is set to FALSE here to indicate that there are no variable names at the top of the data file.
```{r }
afdp <- read.table("data/afdp.dat", header = FALSE ,col.names =c("id","coa","age","gen","stress","emotion","negaff","peer"))
psych::describe(afdp)
```
**peer** adolescent report on peer substance use and peer tolerance of use

**coa** parent report of alcoholism diagnosis where 0=non-alcoholic and 1=alcoholic

**gen** gender 0=girl 1=boy

**age** age measured in years at assessment

**stress** self report measure of uncontrollable negative life stressful events

**emotion** self report measure of temperamental emotional expressiveness

**negaff** self report measure of depression and anxiety


#### 1.2 Simple Regression

![](figure/SimpleRgression.jpg){width=50%}
peer = α + γ coa + ζ

First, we need to create what’s called a **model syntax object**.Then fit the model using the **sem function**, placing the results into the data object fit. Note that we include the **meanstructure** argument with the sem function so that we will obtain an estimate of the regression intercept.
```{r}
model.1 <- 'peer ~ coa'
summary(sem(model.1, data=afdp, meanstructure=TRUE),standardized=TRUE,rsquare=TRUE)
```
we are using ML (maximum likelihood) estimation with the **nlminb optimizer** (this is a built-in optimizer in R for finding the minimum of a function).

It is worth pausing here to note that **lavaan counts parameters differently** than we described in lecture. For this model, we would normally count five parameters, the mean and variance of COA, the intercept and slope of the regression of Peer on COA, and the residual variance of Peer. But lavaan **does not count the mean or variance of the predictor**, COA, as parameters of the model. Nor does it count **covariances among predictors** (though there are no such covariances in the current model) as free parameters. Fortunately, lavaan also does not count these parameters when determining the number of observed moments, so the degrees of freedom for the test statistic work out regardless (e.g., here it neither counts the mean and variance of COA as observed moments nor does it count them as estimated moments, leaving a net difference of zero when calculating the degrees of freedom).

This output also provides some information about model fit. **Multiple regression models are just identified** (i.e., every piece of information provided by the sample is ‘used up’ to estimate model parameters so that no degrees of freedom remain). Therefore, the model fits the data perfectly and it is not worthwhile to interpret the test statistic (this and other fit indices will be discussed in later sections).

The output is subdivided into **three parts, regressions, intercepts, and variances**. In the Regressions section, the peer ~ coa coefficient is the slope parameter estimate γ . In the ˆ Intercepts section, the coefficient listed for .peer is the intercept α . Finally, in the Variances section, the coefficient listed for .peer is the represents represents ψ , the estimated variance of ζ i (i.e., residual variance). The Estimate column lists is the ML point estimate of each parameter. The Std.Err column gives the standard error for each estimate (where these are computed using the expected information matrix, as noted above). Next, the z-value column reports the Wald z-statistic for the null hypothesis test that the parameter is significantly different from zero in the population, and the P(>|z|) column reports the p-value associated with this z-statistic.

Here, we see that the average **non-COA has a score of .298** on peer and the average **COA** has a score that is .176 units higher than non-COAs on peer (**.298 + .176 = .474**). Both the intercept and slope are significantly different from zero. Finally, the **variance** in deviant peer association that is **not explained by COA status is .280**. This indicates that, although COA status is a significant predictor of peer, COA status does not fully account for affiliation with peers.

Because peer is not on an intrinsically meaningful scale, we cannot easily interpret the differences among the regression parameters or residual variances. To better interpret these results, we can request **standardized estimates**. Within lavaan, several methods of standardization are available within the summary function. The first type, obtained by including **standardized=TRUE**, is the typical fully standardized solution. In this case, **both the independent and dependent variables are standardized to have a variance of 1** (note that lavaan does **not**, however, also **make the means 0**, as one might expect). 

The **Std.lv** column refers to a solution in which **only the latent variables are standardized and not the observed variables**. Since the current model contains no latent variables, this is of little use here. The **Std.all** column contains the fully standardized estimates of interest, in which all variables (observed and latent) are standardized to have a variance of one. Thus, we say that a one standard deviation increase in COA status is associated with a .164 standard deviation increase in deviant peer associations.

It may be more useful to retain the original scaling of COA while standardizing peer. If we include **std.nox=TRUE** in the summary function, as shown below, we will obtain estimates in which the latent variables and endogenous observed variables are standardized but the exogenous observed variables are left in their raw scale. These are commonly known as partially standardized estimates.

Finally, we can also calculate how much variance is explained in peer by coa by requesting that lavaan output the r-squared statistic. **R-Square** is the estimated **proportion of variance** in peer that **is accounted for by the models** (i.e., COA, the only predictor included in this model). We have explained less than 3% of the total variance in peer with the COA predictor. Note, however, that measured variable regression models assume that no measurement error is present. If measurement error is present, the estimated relationship among the variables in the model may be attenuated and the R 2 value will also be underestimated.





#### 1.3 Multiple Regression
![](figure/MultipleRegression.jpg){width=50%}
peer = α + γ1coa + γ 2gen + γ 3age + ζ 

we have specified that peer is regression on coa, gen, and age. Note that, again, we use the **regression operator ~ ** to indicate the regression. Now that our model includes multiple predictors, we use the + operator in between each predictor.
```{r}
summary( sem('peer ~ coa + gen + age', data=afdp, meanstructure=TRUE), std.nox=TRUE, rsquare=TRUE)
```
We see that gen is not a significant predictor of peer when COA and age are also included in the model (p = .381). However, age is significantly related with peer such that older adolescents are more likely to associate with deviant peers. We estimate that a one-year increase in age is associated with a .151-increase in deviant peer association ratings. COA remains a significant predictor of peer, and the regression parameter **estimate (i.e., it has increased from .18 to .21) has not changed much** from the single-predictor model to the multiple-predictor model **relative to its standard error (0.06)**. The **stability** of this coefficient **reflects** the fact that **COA is not highly correlated with gen or age**.

To better interpret these results, and to get a sense for the relative contribution of each predictor, we requested the **partially standardized solution**. We prefer this method of standardization for this example because **all three predictors have natural scales**. As can be seen in the Std.nox column of output, after controlling for age and gender, COAs affiliate with deviant peers about .391 standard deviation units more than non-COAs. Each additional year of age is associated with a .281 standard deviation increase in self-reported affiliation with deviant peers.

From the R-square output, we can see that the multiple predictor regression model explains more of the variance in peer than the single predictor model. Age and gender account for an additional 17% of the variance in peer, over and above COA status. Still, approximately **80% of the variance in peer remains unexplained by the variables in our model**.




#### 1.4 Path Analysis

##### 1.4.1 Original Model
<p float="left">
  <img src="figure/path.jpg" width="50%" />
  <img src="figure/path2.jpg" width="40%" /> 
</p>

Theory dictates that **alcoholic parents** increase the number of **stressful life events** that their children experience, leading to an increase in child **negative affect**. Further, children of alcoholics are hypothesized to have higher levels of emotionality, leading to more negative affect. Negative affect is thought to be related to higher rates of **affiliation从属，加盟 with deviant peers**. Stressful life events and emotionality should covary, but we hypothesize no directional relation among these variables. We allow age and gen to predict stress and emotion, and we allow all exogenous characteristics (coa, gen, and age) to covary.
<p float="left">
  <img src="figure/path3.jpg" width="40%" />
  <img src="figure/path4.jpg" width="50%" /> 
</p>

As a **default**, lavaan allows all exogenous variables to covary but it fixes the residual **covariances among endogenous variables to zero**. The residual terms of stress and emotion are freed to covary within the section of the code marked #covariances. To designate a covariance (or variance) we use the ~~ operator. Thus, **stress ~~ emotion** tells lavaan to include a residual covariance for stress and emotion.

The final three lines request the **model-implied covariance matrix (fitted(fit))** and the **raw and standardized residual matrices** (resid(fit, type="raw") and resid(fit, type="normalized"), respectively). These residuals represent the differences between the observed covariance matrix and the matrix that is implied by the structure of the model.
```{r}
pathmodel.1 <- '
              #regressions
              stress ~ coa + gen + age
              emotion ~ coa + gen + age
              negaff ~ stress + emotion
              peer ~ negaff

              #covariances
              stress ~~ emotion              
              '
fit <- sem(pathmodel.1, data=afdp, meanstructure=TRUE)
summary(fit, standardized=TRUE, rsquare=TRUE)
```

We report **fully standardized effects** (from Std.all column generated by the first summary function call) **except for coa, gen, and age**, for which we report partially standardized effects (from **Std.nox** column generated by the second summary function call). Recall that only the outcome variables are standardized in computing the partially standardized effects, which makes them particularly useful for examining the effects of coding variables (e.g., coa and gen) or predictors with natural metrics (e.g., age).

These partially standardized parameter estimates represent the expected change in standard deviation units in y given a one unit increase in x. By comparison, the fully standardized estimates are computed by standardizing both the predictors and the outcome variables so that parameter estimates represent the expected change in standard deviation units in y given a one standard deviation increase in x. These are most useful for interpreting effects when neither predictors nor outcomes have natural scales and when gauging relative effect sizes across predictors on different scales. **For covariance parameters, fully standardized estimates can be interpreted as correlations**.

From the R-Square section of output, we can also see that only a modest amount of the total variance in any of these variables has been explained by the model.

```{r}
resid(fit, type="normalized")
```

The normalized residuals are rescaled to follow a standard normal distribution and values exceeding plus or minus two are often taken as potentially meaningful in magnitude. The normalized residuals for this model suggest that the hypothesized structure is doing a **poor** job **in reproducing** the observed covariances between several variables, most notably between **age and peer, between age and negaff, and between stress and peer**. We will explore methods for more formally assessing overall model fit in the next chapter.

##### 1.4.2 Evaluation and re-specification

We obtain measures of fit, appropriately enough, by including **fit.measures=TRUE**
```{r}
summary(fit, fit.measures=TRUE)
```

We can request that lavaan suggest changes to our model based on the expected change in model chi-square if a fixed parameter were freed; these are the modification indices (MIs). We can obtain MIs with the modindices function. The sort. argument is set to TRUE so that lavaan will present the largest MIs first. Additionally, the minimum.value argument is set to suppress printing of any MIs less than 10.

```{r}
modindices(fit, sort.=TRUE, minimum.value=10)
```
Here, lhs stands for “on the left-hand side of the operator” and rhs stands for “on the righthand side of the operator.” The operator, denoted op, for these parameters is ~, indicating these are regression slopes. **mi** denotes the expected **improvement in the model chi square test statistic** if the modification is accepted. epc denotes “expected parameter change (EPC)” and gives the expected parameter value if the modification is implemented (since it is changing from zero). sepc.lv re-scales the EPC by standardizing any latent variables in the model, sepc.all re-scales the EPC by standardizing all variables (observed and latent), and sepc.nox re-scales the EPC by standardizing all variables except exogenous predictors. Here, sepc.lv is equivalent to the EPC because there are no latent variables in the model. It can be helpful to rely on standardized EPC values in order to get a sense of the relative magnitude of each potential modification.

**Final Model**

![](figure/path5.jpg){width=50%}

```{r}
pathmodel.5 <-'
              #regressions
              stress ~ coa + gen + age
              emotion ~ coa + gen + age
              negaff ~ stress + emotion + age
              peer ~ negaff + age + coa

              #covariances
              stress ~~ emotion              
              '
fit.5 <- sem(pathmodel.5, data=afdp, meanstructure=TRUE)
summary(fit.5, fit.measures=TRUE)
```

##### 1.4.3 Mediation Test

Mediator explains why or how one variable exerts an effect on another. Indirect effect **point estimate** is **product of coefficients** in chain. the CI is calculated by **delta method** and **bootstrapping**.

**Significant links in a mediational pathway are not sufficient to infer mediation**. In order to formally test the full mediation effect, we need an inferential test of the entire specific indirect effect in question. Here, we want to know whether the specific mediational pathway of **COA status on deviant peer affiliation via stressful events and negative affect** is statistically significant, whether the specific mediational effect of **coa on peer via emotionality and negative affect** is significant, and whether the **overall indirect effect of coa on peer** is significant. Finally, we would like to have an estimate of the **total effect** of coa on peer considering both direct and indirect pathways.

we’ve added **parameter labels** for the paths involved in the **direct and indirect effects of coa on peer**. For instance, in the line stress ~ c_s*coa + gen + age, we have supplied the label c_s for the effect of coa on stress. Likewise, the other paths involved in computing the effects of interest have also been labeled. To help keep track of things, our labels consist of the first letter of the predictor, an underscore, and first letter of outcome, but any labels are fine. We can now refer to these labels when defining direct, indirect, and total effects.

Following the core model specification, we have new lines to define the direct, specific indirect, total indirect, and total effects. This is done using the **“define” operator :=**. For instance, ind_s := c_s*s_n*n_p requests that lavvan compute the quantity ind_s as the product of the three paths previously labeled c_s, s_n, and n_p.

```{r}
effects.5 <-'
            #regressions
            stress ~ c_s*coa + gen + age
            emotion ~ c_e*coa + gen + age
            negaff ~ s_n*stress + e_n*emotion + age
            peer ~ n_p*negaff + age + c_p*coa 

            #covariances
            stress ~~ emotion         

            #direct effect
            dir := c_p

            #specific indirect effects
            ind_s := c_s*s_n*n_p
            ind_e := c_e*e_n*n_p

            #total indirect effect
            tot_ind := c_s*s_n*n_p + c_e*e_n*n_p

            #total effect
            tot := c_s*s_n*n_p + c_e*e_n*n_p + c_p
            '
set.seed(62973)
fit.5b <- sem(effects.5, data=afdp, meanstructure=TRUE, se="bootstrap", bootstrap=1000)
```

Note that we have used the set.seed function to set the random number seed for selecting the bootstrapped samples so that we can reproduce exactly these same results each time we run the script.

Then, in the sem function, we specified se="bootstrap", bootstrap=1000 to request that lavaan compute **bootstrapped standard errors** with 1000 bootstrapped samples. We actually don’t care much about the standard errors. What we really want are bootstrapped confidence intervals, which are generated in the parameterEstimates function with the boot.ci.type argument. The parameterEstimates function produces a simple list of the parameter estimates from the model. By including the boot.ci.type argument, we request that this output include 95% boostrapped confidence intervals.

Note that there exist **multiple methods** for calculating **bootrapped CIs**. The **percentile** method is probably easiest to understand – the 1000 boostrapped estimates are ordered by magnitude and the 25 th estimate (2.5 th percentile) and 975 th estimate (97.5 th percentile) are taken as the confidence limits, yielding a 95% confidence interval. This method, which is widely available in many software programs, is obtained via boot.ci.type = "perc". That is the method we used in the lecture notes. A slightly preferable approach, however, is to use bias-corrected bootstrapped confidence intervals, and these are readily available in lavaan. To obtain the **bias corrected** confidence intervals, we simply use boot.ci.type = "bca.simple". The biascorrected intervals are shown in the output below. In terms of null hypothesis testing, the same conclusions are generated with the bias-corrected CIs as with the percentile CIs presented in lecture, but there are some slight differences in the specific values of the confidence limits.
```{r}
parameterEstimates(fit.5b, boot.ci.type = "perc")
parameterEstimates(fit.5b, boot.ci.type = "bca.simple")
```

The total effect of coa on peer is equal to .209 and represents a combination of the direct effect (.185) and the total indirect effect (.024). The 95% CI is equal to .094 and .306; because this does not contain zero, the total effect is deemed to be significant.

Examining the mediational pathways, we see that the specific indirect effect coa→stress→negaff→peer, labeled **ind_s** to indicate it is the indirect effect through stress, is equal to .015 (95% CI=.005, .036) and is **significant** (does not include zero). The biological pathway from coa→emotion→negaff→peer, labeled **ind_e** to indicate it is the indirect effect through emotion, is equal to .009 (95% CI=0, .022) and thus does **not reach statistical significance** (because the lower CI is equal to zero).


### 2. Factor Analysis
#### EFA
**Latent variable also called**: unmeasured variables, constructs, factors, true scores, unobserved variables, unobserved heterogeneity. Latent variable is **used for**: factors in factor analysis, latent variables in structural equation models, basis curves in latent curve analysis, latent classes (class/cluster/mixture models), latent traits, underlying propensity variables for censored/limited dependent variables, missing data, **random effects in a multilevel or mixed model**.

Goals of Factor Analysis: determine the underlying structure behind a set of observed measures.

Exploratory factor analysis is primarily data-driven: **all observed variables regressed on all factors** and the regression slopes. Confirmatory factory analysis is formally testing a theoretical factor structure, number and nature of latent variables specified by theory, relationships between indicators and latent variables specified by theory.

**Outcome variable is regressed on the predictors.** As the target predicted outcome y depends on the predictor x, you can say that "is regressed on" means "is dependent on". The word "regressed" is used instead of "dependent" because we want to emphasise that we are using a regression technique to represent this dependency between x and y.

![](figure/EFA3.jpg){width=50%}
![](figure/EFA2.jpg){width=40%}

$\Theta$ is residual covariance matrix

**Communality**. $R^2$ for regression of the indicator on the set of common factors: items with high communalities are good items, as most of their variance reflects the influence of the common factors
$$
h^2=1-\frac{VAR(\varepsilon)}{VAR(y)}
$$

Conducting Process

1. Determine number of latent factors
2. Rotate factor pattern matrix at chosen number of factors
3. Eliminate problematic items and refit model (items with large cross-loadings or with low communalities, redundant itmes otherwise may hijack the factor)
4. Inspect final pattern matrix to determine meaning of latent factors

#### CFA

![](figure/CFA1.jpg){width=30%} ![](figure/CFA2.jpg){width=30%} ![](figure/CFA3.jpg){width=30%}

Setting the scale of latenet variables

1. set means and variances of the latent variables to zero and one
    + puts latent variables on a standardized scale
    + allows for estimation of all factor loadings (and intercepts)
    + same loading scaling option that is used in EFA

2. give each latent variable the same metric as one of the observed variables. The intercept and loading for the "scaling item" are set to zero and one

CFA Example dataset description---
The data for this demonstration were provided by Holzinger & Swineford in their 1939 monograph A Study in Factor Analysis: The Stability of a Bi-Factor Solution. The sample includes 301 7th and 8th grade students, between 11-16 years of age, drawn from two schools. The data is in the text file hs.dat and the accompanying R-script file is ch04.R. The variables in the data set that we will use are

visperc visual perception test in which participants select the next image in a series

cubes  visual perception test in which participants must mentally rotate a cube

lozenges  visual perception test involving mental “flipping” of a parallelogram (“lozenge”)

parcomp  paragraph comprehension test

sencomp sentence completion task in which participants select most appropriate word to put at the end of a sentence

wordmean verbal ability test in which participants must select a word most similar in meaning to a word used in a sentence.

addition participants have 2 minutes to complete as many 2-number addition problems as they can

countdot participants have 4 minutes to count the number of dots in each of a series of dot pictures

sccaps participants have 3 minutes to indicate whether capital letters are composed entirely of straight lines or include curved lines.

Other variables in the data not included in the models fit here are school, female, age, and month.

```{r}
hs <- read.table("data/hs.dat", header=FALSE) 

names(hs) <- c("school", "female", "age", "month", "visperc", "cubes", "lozenges", "parcomp", "sencomp", "wordmean", "addition", "countdot", "sccaps")
```

rescaled the variables addition, countdot, and sccaps by dividing by four: This was done to facilitate model estimation, as numerical instability problems can arise when using variables on widely differing scales. Dividing these variables by four brings their standard deviations closer to the standard deviations of the other observed variables.

```{r}
hs$addition <- hs$addition/4 
hs$countdot <- hs$countdot/4 
hs$sccaps <- hs$sccaps/4
```

Finally, we load the lavaan package that we will use to fit the CFA models:

![](figure/CFA_eg.jpg){width=40%} 

cfa function in lavaan imposes a number of defaults consistent with how CFA models are typically specified and enables us to specify models with compact model syntax objects. The default scaling for the latent variables is a hybrid of the approaches we described in lecture. The default scaling for the latent variables is a hybrid of the approaches we described in lecture. By default, the **latent variables will have means set to zero** but freely estimated variance. Additionally, the **factor loading of the first indicator for each latent variable will be set to one** but the intercept for this indicator will be freely estimated. We will override these defaults to implement the scaling options described in lecture. in defining the model syntax object, we implement the =~ operator, which is used to define latent variables (left hand side) and to specify the variables that load on them (right hand side). Thus, visual =~ visperc + cubes + lozenges defines a new latent variable, visual, and indicates that visperc, cubes, and lozenges are indicator variables.

```{r}
cfa.1a <-'#factor loadings
          visual =~ visperc + cubes + lozenges
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps
         '

fit.1a <- cfa(cfa.1a, data=hs, meanstructure=TRUE, std.lv=TRUE)
summary(fit.1a, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)
```
The modification indices for the model are obtained from the modindices function as follows:
```{r}
modindices(fit.1a, sort.=TRUE, minimum.value=10)
```

Here we see the largest modification indices are associated with a cross-loading for sccaps on visual, and a correlated uniqueness for countdot with addition. It is important to keep in mind that both modification indices may be related to the same misspecification.


Initial Model with Scaling Items: model fit is the same

We shall choose the first indicator for each factor to be the scaling item. The intercept and factor loading for each scaling item is set to zero and one, respectively. This scaling option permits the means and variances of the latent factors to be estimated.


```{r}
cfa.1b <-'#factor loadings
          visual =~ visperc + cubes + lozenges
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps
     
          #means/intercepts
          visual ~ 1
          verbal ~ 1
          speed ~ 1
          visperc ~ 0*1
          parcomp ~ 0*1
          addition ~ 0*1
         '

fit.1b <- cfa(cfa.1b, data=hs, meanstructure=TRUE)
#summary(fit.1b, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)
```

![](figure/CFA_m.jpg){width=40%}

```{r}
cfa.2 <- '#factor loadings
          visual =~ visperc + cubes + lozenges + sccaps
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps
         '
fit.2 <- cfa(cfa.2, data=hs, meanstructure=TRUE, std.lv=TRUE)
#summary(fit.2, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)

lavTestLRT(fit.2, fit.1a)
```

```{r}
modindices(fit.2, sort.=TRUE, minimum.value=10)
```
Note that no further changes are suggested for the model.

![](figure/CFA_m2.jpg){width=40%}
 
```{r}
cfa.3 <- '#factor loadings
          visual =~ visperc + cubes + lozenges 
          verbal =~ parcomp + sencomp + wordmean
          speed =~ addition + countdot + sccaps

          #covariances
          addition ~~ countdot
         '
fit.3 <- cfa(cfa.3, data=hs, meanstructure=TRUE, std.lv=TRUE)
#summary(fit.3, fit.measures=TRUE, standardized=TRUE, rsquare=TRUE)

lavTestLRT(fit.3, fit.1a)

modindices(fit.3, sort.=TRUE, minimum.value=10)
```

This model is not nested with the alternative modified model; however, the two models provide nearly equivalent fit (RMSEA=.065 versus .066, TLI=.948 versus .946, etc.). Thus, model selection should be based on which modification is most plausible, and upon the interpretability of parameter estimates.



### 3. LSEM

1. the measurement model is identical to CFA $y_i= v +\Lambda\eta_i + \varepsilon_i$
2. the structural model is similar to path analysis $\eta_i=\alpha + B\eta_i + \Gamma x_i + \zeta_i$

Measurement invariance (MI) refers to extent to which same measurements conducted under different conditions yield same measures of attributes under study. Longitudinal invariance is said to exist if the distribution of the observed variables depends only on the level of the latent variable and not also on the time point at which it was measured f(y|$\eta$,t)= f(y|$\eta$)

The latent growth curve is capturing the growth as a latent factor. The latent curve model (LCM) is fundamentally a highly restricted CFA model. The reduced-form equation is $y_i= v +\Lambda(\alpha+\zeta_i) + \varepsilon_i$

![](figure/LCM1.jpg){width=50%} ![](figure/LCM2.jpg){width=40%}


![](figure/LCM_q.jpg){width=40%}  ![](figure/LCM_p.jpg){width=40%}

**Time invariate predictor, predict both intercept and slope** change structural model: add $\Gamma x_i$ into it. $y_i= v +\Lambda(\alpha+\Gamma x_i+\zeta_i) + \varepsilon_i$

![](figure/LCM3.jpg){width=40%} ![](figure/LCM4.jpg){width=40%}
while time-invariant covariates add predictors into structual model, time-varying covariates add predictors into measurement model. add $Kz_i$. $y_i= v +\Lambda(\alpha+\zeta_i) +Kz_i+ \varepsilon_i$

![](figure/LCM5.jpg){width=40%} ![](figure/LCM6.jpg){width=40%}

### 4. MLM
![](figure/mixed_model.jpg){width=100%}
lme {nlme} function is flexible in autocorrelation structure
```{r}
summary(lme(AIQ~1 + SZ +T1+T2+T3 + SZ:T1 + SZ:T2 + SZ:T3, random = ~T1+T2|ID, data = cog_pre_long, na.action = na.omit,
     correlation = corCompSymm(form =  ~ 1 | ID),method = "ML", 
     control =list(msMaxIter = 1000, msMaxEval = 1000)))
```

lmer {lme4}
```{r}
summary(lmer(AIQ ~ 1 + SZ +T1+T2+T3 + SZ:T1 + SZ:T2 + SZ:T3 + (1+T1+T2|ID), data=cog_dat_long))
```

